A Phonetic Morpheme Lexicon for German
نویسنده
چکیده
3. PURPOSES The availability of computerized lexical data is growing. In spite of oriented areas and basic research. The most obvious technical this fact, little resources are available for the minimal functional units applications are text-to-speech synthesis (TTS) and automatic speech of language: morphemes. For German several morpheme lexica recognition (ASR). provide morphemes in orthographical representation, only one of them provides this information in a machine readable form. Yet, no resource is available that lists the complete morpheme inventory of For TTS, three approaches and their combinations to transcription German in a phonetical representation. This paper argues in favour of are possible: A transcription database of complete words, a phonetic morpheme databases for phonetic research and speech morphological analysis, and a rule based letter-to-sound (LTS) applications. Procedures for the development of a database for algorithm. With a list of complete words and their transcriptions, new German are described and first results from analyses are reported. words cannot be processed, a pure LTS approach can process 1. MOTIVATION The most common conventional technique for the storage of texts is analysis will be superior to approaches that do without. its orthographic representation. The standard minimal semantic units For morpheme based transcription algorithms, mainly two used in orthographic representations are words. Thus, these units are methods can be distinguished. In both cases an orthographical most often used as the basic entities for providing information on the database of morphemes and a model for their combination in words correspondence between orthographical and acoustical manifestations is needed. The approaches differ in the intermediate step to of language: Pronunciation dictionaries represent words in both transcription: Either the morphemes recognized are transcribed by an orthographical and phonetical form. The provision of access to the LTS algorithm (transcription approach) or their phonetic transcription symbolic link between these two representations is an important is provided i.e. has been previously transcribed in the same factor for many applications in speech technology. database and only looked up (lookup approach). The last step is From the viewpoint of semantics, morphemes must be equivalent again: portions of the words that could not be analyzed are considered the smallest sign units. Yet, these basic semantic units of transcribed by LTS rules and the transcriptions are combined to one speech remain opaque for technical applications and linguistic phonetic representation of the orthographic word. research: For no language is there a complete database that lists the For the development of a high quality LTS algorithm, be it a phonetical description of its morphemes. For German, many symbolic self-learning approach or a set of explicit rules, a database has to be databases are available, but only few deal with morphological units used that serves as training material (self learning method) or ([1,2,3]) and even less only [3], for a small subset represent evaluation database (explicit rules). Thus, any LTS approach requires phonetic information of German morphemes. a transcription database and as its output has to be tested for a finite 2. WORD VS. MORPHEME The technical value of a morpheme inventory may strongly depend At last, a static database will be easier to maintain than any on the structure of the specific language it is produced for: German is complex rule based LTS system which will always need an exception a language for which setting up a morpheme database seems very list for those entries that are defined as extreme deviations from the promising since German extensively uses derivational and standard, as it would be too costly to devise special rules for them. compositional processes. German is very productive, most words can be split into two ore more morphemes. Compared to databases which provide information on whole In the field of ASR and for the application of HMMs on many levels words and their corresponding transcriptions, a morpheme based it would be advantageous to add another intermediate layer between database is easier to maintain: First of all there are less entries and any the levels of words and sounds. Adding the level of morphology change to be applied to many entries will thus need less effort. would constrain the number of possible candidates and thus yield Second, since German is so productive, a list of words for this smaller error rates in speech recognition. For the training of the language will never be complete and virtually never be up to date. On relevant models, the training material orthographical words and the other hand a word database will include many entries with very their corresponding phonetical representations would have to be low frequencies. A morpheme database on the other hand, is only segmented and aligned accordingly. This material could be produced subject to little fluctuation. Thus, with a morpheme inventory at in either ways described above for the morphology based approach to hand, a higher coverage of German words can be obtained. phonetic transcription. A phonetical morpheme database can be used both in application
منابع مشابه
Pronunciation lexicon modeling and design for Korean large vocabulary continuous speech recognition
In this paper, we describe a pronunciation lexicon model which is especially useful for constructing morpheme-based pronunciation lexicon to improve the performance of a Korean LVCSR. There are a lot of pronunciation variations occurring at morpheme boundaries in continuous speech. For modeling of cross-morpheme pronunciation variations, we usually used a context-dependent multiple pronunciatio...
متن کاملUtilizing prosody for unconstrained morpheme recognition
Speech recognition systems for languages with a rich in ectional morphology (like German) su er from the limitations of a word{based full{form lexicon. Although the morphological and acoustical knowledge about words is coded implicitly within the lexicon entries (which are usually closely related to the orthography of the language at hand) this knowledge is usually not explicitly available for ...
متن کاملA Phonetic Lexicon for Adaptation in ASR for Austrian German
We present a phonetic lexicon for Austrian German, which was generated automatically from the canonic version of a German pronunciation dictionary. The lexicon is based on narrow transcription in Sam-Pa. Both the speech files and the canonic dictionary are taken from the SpeechDat-AT database. Since the recorded items are mainly read speech the differences between the canonic form and the real ...
متن کاملLarge vocabulary continuous speech recognition based on cross-morpheme phonetic information
In this paper, we present a novel method to regulate lexical connections among morpheme-based pronunciation lexicons for Korean large vocabulary continuous speech recognition (LVCSR) systems. A pronunciation dictionary plays an important role in subword-based LVCSR in that pronunciation variations such as coarticulation will deteriorate the performance of an LVCSR system if it is not well accou...
متن کاملA Unified Framework for Text Analysis in Chinese TTS
This paper presents a robust text analysis system for Chinese text-tospeech synthesis. In this study, a lexicon word or a continuum of non-hanzi characters with the same category (e.g. a digit string) are defined as a morpheme, which is the basic unit forming a Chinese word. Based on this definition, the three key issues concerning the interpretation of real Chinese text, namely lexical disambi...
متن کامل